The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
In [1]:
# Import required libraries
from tpot import TPOT
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
In [2]:
#Load the data
Marketing=pd.read_csv('Data_FinalProject.csv')
Marketing.head(5)
Out[2]:
In [3]:
Marketing.groupby('loan').y.value_counts()
Out[3]:
In [4]:
Marketing.groupby(['loan','marital']).y.value_counts()
Out[4]:
The first and most important step in using TPOT on any data set is to rename the target class/response variable to class.
In [5]:
Marketing.rename(columns={'y': 'class'}, inplace=True)
At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 11 categorical variables which contain non-numerical values: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, class.
In [6]:
Marketing.dtypes
Out[6]:
We then check the number of levels that each of the five categorical variables have.
In [7]:
for cat in ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome' ,'class']:
print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, Marketing[cat].unique().size))
As we can see, contact and poutcome have few levels. Let's find out what they are.
In [8]:
for cat in ['contact', 'poutcome','class', 'marital', 'default', 'housing', 'loan']:
print("Levels for catgeory '{0}': {1}".format(cat, Marketing[cat].unique()))
We then code these levels manually into numerical values. For nan i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.
In [9]:
Marketing['marital'] = Marketing['marital'].map({'married':0,'single':1,'divorced':2,'unknown':3})
Marketing['default'] = Marketing['default'].map({'no':0,'yes':1,'unknown':2})
Marketing['housing'] = Marketing['housing'].map({'no':0,'yes':1,'unknown':2})
Marketing['loan'] = Marketing['loan'].map({'no':0,'yes':1,'unknown':2})
Marketing['contact'] = Marketing['contact'].map({'telephone':0,'cellular':1})
Marketing['poutcome'] = Marketing['poutcome'].map({'nonexistent':0,'failure':1,'success':2})
Marketing['class'] = Marketing['class'].map({'no':0,'yes':1})
In [10]:
Marketing = Marketing.fillna(-999)
pd.isnull(Marketing).any()
Out[10]:
For other categorical variables, we encode the levels as digits using Scikit-learn's MultiLabelBinarizer and treat them as new features.
In [11]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
job_Trans = mlb.fit_transform([{str(val)} for val in Marketing['job'].values])
education_Trans = mlb.fit_transform([{str(val)} for val in Marketing['education'].values])
month_Trans = mlb.fit_transform([{str(val)} for val in Marketing['month'].values])
day_of_week_Trans = mlb.fit_transform([{str(val)} for val in Marketing['day_of_week'].values])
In [12]:
day_of_week_Trans
Out[12]:
Drop the unused features from the dataset.
In [13]:
marketing_new = Marketing.drop(['marital','default','housing','loan','contact','poutcome','class','job','education','month','day_of_week'], axis=1)
In [14]:
assert (len(Marketing['day_of_week'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done
In [15]:
Marketing['day_of_week'].unique(),mlb.classes_
Out[15]:
We then add the encoded features to form the final dataset to be used with TPOT.
In [16]:
marketing_new = np.hstack((marketing_new.values, job_Trans, education_Trans, month_Trans, day_of_week_Trans))
In [17]:
np.isnan(marketing_new).any()
Out[17]:
Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.
In [18]:
marketing_new[0].size
Out[18]:
Finally we store the class labels, which we need to predict, in a separate variable.
In [19]:
marketing_class = Marketing['class'].values
To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.
In [20]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(Marketing.index, stratify = marketing_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size
Out[20]:
After that, we proceed to calling the fit, score and export functions on our training dataset. An important TPOT parameter to set is the number of generations. Since our aim is to just illustrate the use of TPOT, we have set it to 5. On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours.
In [21]:
tpot = TPOT(generations=5, verbosity=2)
tpot.fit(marketing_new[training_indices], marketing_class[training_indices])
In the above, 5 generations were computed, each giving the training efficiency of fitting model on the training set. As evident, the best pipeline is the one that has the CV score of 78.402%. The model that produces this result is one that fits a passive aggressive algorithm on the data set. Next, the testing error is computed for validation.
In [22]:
tpot.score(marketing_new[validation_indices], Marketing.loc[validation_indices, 'class'].values)
Out[22]:
In [23]:
tpot.export('tpot_marketing_pipeline.py')
In [ ]:
# %load tpot_marketing_pipeline.py
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR')
training_indices, testing_indices = train_test_split(tpot_data.index, stratify = tpot_data['class'].values, train_size=0.75, test_size=0.25)
result1 = tpot_data.copy()
# Perform classification with a passive aggressive classifier
pagr1 = PassiveAggressiveClassifier(C=0.81, loss="squared_hinge", fit_intercept=True, random_state=42)
pagr1.fit(result1.loc[training_indices].drop('class', axis=1).values, result1.loc[training_indices, 'class'].values)
result1['pagr1-classification'] = pagr1.predict(result1.drop('class', axis=1).values)
In [ ]:
In [ ]:
In [ ]: